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Abstract 

In this paper, we derive a randomized version of the Mirror-Prox method for solving some 
structured matrix saddle-point problems, such as the maximal eigenvalue minimization prob- 
lem. Deterministic first-order schemes, such as Nesterov's Smoothing Techniques or standard 
Mirror-Prox methods, require the exact computation of a matrix exponential at every iteration, 
limiting the size of the problems they can solve. Our method allows us to use stochastic approx- 
imations of matrix exponentials. We prove that our randomized scheme decreases significantly 
the complexity of its deterministic counterpart for large-scale matrix saddle-point problems. 
Numerical experiments illustrate and confirm our theoretical results. 

Keywords: stochastic approximation, Mirror-Prox methods, matrix saddle-point problems, eigen- 
value optimization, large-scale problems, matrix exponentiation 

1 Introduction 

Large-scale semidcfinitc programming attracts substantial research efforts nowadays. A vast set of 
applications can be modeled as such optimization problems, and many strategies have been studied 
theoretically and implemented in excellent softwares. 

Arguably, general purpose semidcfinitc methods suffer from an intrinsic drawback. They forbid 
themselves, for the sake of generality, to exploit explicitly some structural features of the particular 
instance they are given to solve, hampering the resolution of very large-scale problems. 

As a result, we are witnessing the development of special-purpose algorithms, designed for 
particular subclasses of semidcfinitc optimization problems, where the utilization of their specific 
structure is instrumental for aiming at large size problems. In this paper, we are addressing the 
problem of minimizing the maximal eigenvalue of an affine combination of given symmetric matri- 
ces, plus a linear function of the coefficients of this affine combination. Different strategies have 
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been devised to deal specifically with the maximal eigenvalue minimization problem. Among the 
first investigated techniques, Bundle methods were introduced in [HR001 IOus00| . and subsequently 
refined in a number of further papers. Theoretical results on Bundle methods concern mainly 
asymptotic convergence properties; to the best of our knowledge, no complexity guarantees have 
been obtained so far in this direction. 

Another strategy has been discovered in [Nes07| when Nesterov showed how his Smoothing 
Techniques can be specialized to the maximal eigenvalue minimization problem. As a result, he 
obtained worst-case complexity guarantees for his method: if e > is the desired absolute accuracy 
on the objective value, Ay, . , . , A m are real symmetric n x n-matrices, A m C R m is the (m — 1)- 
dimcnsional simplex, and A max (^4) denotes the maximal eigenvalue of any real symmetric matrix 
A, his algorithm solves the problem 



in 0((n 3 + mn 2 ) max.,- A max (| A/ |)-\/ln(n) ln(m)/e) elementary operations, where |^4| = y/A 2 for 
any real symmetric matrix A. Almost simultaneously, the paper |Nem04a] develops a Mirror-Prox 
method, which can be particularized as well for our problem, and obtains equivalent complexity 
results. Two papers by Warmuth et al. [TRW05| IWK06] present a scheme - called the Matrix 
Exponentiated Gradient Update method - that can basically be applied to problem fl}. A form of 
this algorithm was independently discovered by Arora and Kale |AK07] . The methods of Arora et 
al. and Warmuth et al. essentially reduce to a subgradient method (provided that we adapt them 
to our problem), but with a worse complexity guarantee: in order to find a solution to problem 
dU) with absolute accuracy e, these methods need C(maxj X^^QAj]) ln(n) / e 2 ) iterations. Each of 
these iterations requires the computation of a matrix exponential and further operations with a 
cost not exceeding 0(mn 2 ). 

In a nutshell, the methods introduced in [AK07| ITRW05| IWK06] present the same compu- 
tational bottleneck as Smoothing Techniques and the Mirror-Prox method when applied to our 
problem: at every iteration, all these schemes require the determination of a symmetric matrix's 
exponential. Several efforts have been carried out to reduce the iteration computation cost. In 
|d'A08aj . d'Aspremont analyzes the possibility of using approximate gradients, and thereby approx- 
imate matrix exponentials in Nesterov's Smoothing Techniques. In | JNT08] . Mirror-Prox methods 
for variational problems where extended to situations where only some stochastic information is 
available from the instance to solve. These methods were particularized to the maximal eigenvalue 
minimization problem where all the input matrices share the same block-diagonal pattern. Albeit 
the problem is completely deterministic, an artificial randomization was introduced in the oracle 
of the method, which reduced the iteration cost while retaining some probabilistic guarantees on 
the output of the algorithm. Finally, Arora and Kale |AK07| obtain, by approximating the rows 
of exjp(X/2), a substitute for the exact Gram matrix exp(X), where X is some real symmetric 
n x 7i-matrix. The rows of exp(X/2) are approximated by projecting an appropriate truncation of 
the exponential Taylor series approximation on, roughly speaking, C(l/e 2 ) random directions. 

In this paper, we apply the general results of |JNT08] to analyze another randomization strategy 
for computing matrix exponentials, which is also based on a vector sampling and on an appropriate 
truncation of the exponential Taylor series. Whereas we consider the same number of terms in 
the Taylor series approximation of the matrix exponential as Arora and Kale |AK07| do, we can 
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significantly reduce the number of required random vectors: roughly speaking, we project the 
truncated Taylor series on C(l/e) random directions. 

The approximation strategy developed in this paper proves to be theoretically efficient for large- 
scale problems. In theory, it outperforms all its competitors on a reasonably large set of instances, 
described by the size of the input matrices, their number, their sparsity, and the desired accuracy. 
Our theoretical conclusions arc demonstrated by numerical evidence: for solving problems (up to 
a relative accuracy of 0.2%) that involve a hundred matrices of size 800 x 800, the Mirror-Prox 
method equipped with our randomization procedure requires on average about, roughly speaking, 
half of the CPU time needed by the Mirror-Prox method with exact computations. 

The paper is organized as follows. Section 2 contains the necessary notational conventions 
and a brief recall on existing results on Mirror-Prox methods for general convex problems with 
approximate oracle. We particularize these considerations in Section 3 to slightly structured matrix 
saddle-point problems and we analyze the stochastic exponential approximation strategy briefly 
described above for computing an approximate oracle. In Section 4, we derive the complexity of 
solving the maximal eigenvalue minimization problem. In Section 5, we test our method for solving 
large-scale eigenvalue optimization problems, comparing its efficiency with the provably best purely 
deterministic method in terms of worst-case complexity. 

2 Mirror-Prox methods with approximate first-order infor- 
mation: a review 

Let £ be a Euclidean space with inner product (•, •). We endow E with a norm ||-||, which may 
differ from the one that is induced by this inner product. The conjugate norm \\-\\ t to ||-|| is defined 
as: 

IML : = max[(w,x) : ||a;|| = 1] . 

x£E 

2.1 Variational inequalities and saddle-point problems 

Variational inequalities. Let Q be a non-empty convex compact subset of E, and let F : Q —> E 
be a Lipschitz continuous monotone mapping with Lipschitz constant L > 0: 

\\F(z) - F(z')IL <L\\z- z'\\ V z,z' G Q 
(F(z)-F(z'),z-z') >0 Vz,z'eQ. 

The variational inequality associated with the set Q and the operator F reads as follows: 

find z* e Q such that (F(z), z* - z) < for all z E Q. (2) 

In the sequel, in order to measure the inaccuracy of a point z € Q as a candidate solution to ([2]), 
we use the dual gap function 

e(z) := max (F(z), z — z) . 
For z G Q, we clearly have e(z) > 0, and e(z) = if and only if z solves ©. 
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Saddle point problems. Assume that E := E\ X E% for Euclidean spaces E\ and E2, and 
that Q := Qi X Q2 is non-empty with two convex compact sets Qi C E\ and Q2 C E<z- Let 
(f> : Q\ x Q 2 — > M be a convex-concave function. We restrict ourselves to functions that are 
differcntiablc with Lipschitz continuous gradient. The function (/>(■, •) is associated with the saddle- 
point problem 



min max0(x,j/). 



(3) 



Due to the standard Minimax Theorem in Convex Analysis (see Corollary 37.3.2 in |Roc70j ) . we 
have the following pair of primal-dual convex optimization problems: 



mm 



(x) := max <p(x, y) 



max 

yeQ2 



<Kv) 



mm <j>(x, y) 



It is well known that the solutions to the saddle point problem (J3)) are exactly the pairs (a;* , ) 
comprised of optimal solutions to the above two optimization problems, and that these pairs are 
exactly the solutions to the variational inequality given by Q = Qi x Q2 and the monotone operator 



F(x,y) 



d<j>(x, y) # d<f>(x, y) 
dx ' dy 



E 1 x E 2 . 



We quantify the accuracy of a candidate solution z = (x, y) € Q to the saddle point problem 
by the value of the corresponding duality gap 



^sad 



(z) :- 



(x) — <p(y) — max <p(x, v) — min (f>(u, y). 



Due to the convex-concave structure of (f>, any point z = (x, y) £ Q\ x Q2 constitutes an e sad (z)- 
approximate solution to the variational inequality that is associated with Q = Q\ x Q2 and the 
above F: 

e sad (x, y) = max [0(x, y) - </>(x, y) + (f>(x, y) - <f>(x, y)] 
(x,y)eQixQ 2 

— max (F(z), z — z) — e(x, y). 



> max 

(x,y)£QixQ 2 



d0(x,y) 
dx 



9<l>(x,y ) 
dy 



,y-y 



2.2 Mirror-Prox algorithm: preliminaries 

In its basic form, the Mirror Prox (MP) algorithm is aimed at solving variational inequalities on 
a convex compact subset Q of a Euclidean space E equipped with a norm || • ||. The setup for 
the algorithm is given by a distance-generating function (d.-g.f.) u: Q — > K which possesses the 
following properties: 

o u is continuous and convex on Q. In particular, the domain Q° = {x € Q : duj{x) ^ 0} of the 
subdifferential of u) is nonempty. 

o wis regular on Q°, i.e., the subdifferential duj(-) admits a continuous selection uj'(-) on Q°. 
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o The function u) is strongly convex modulus 1 with respect to || • ||: 

(cj'(z) - iu'(y),z -y)>\\z-y\\ 2 V^e Q°. 

In the sequel, we refer to the latter property as the compatibility of the d.-g.f. uj(-) and the 
norm | • 1 1 . 

Furthermore, we suppose that we choose u such that we can easily solve problems of the form: 

min [uj(z) - (e, z)\ , e £ E. (4) 

Remark 2.1 TTie optimal solution z e to clearly exists, is unique by continuity and strong con- 
vexity of u, and belongs to Q° (indeed, by evident reasons e £ du(z e )). From regularity ofuj(-), it 
immediately follows that 

(u'{z e )-e,z-z e ) >0 VzeQ. (5) 
A d.-g.f. lu(-) gives rise to several entities: 

o The ui-center z u := argmin z6 Q to(z) of Q. 

o The Bregman distance V z (w) = ui(w) — w(z) — (td'(z),w — z), where z £ Q° and w £ Q. By 
strong convexity, we have: 

V z (w)>^\\w-zf V(zeQ°,weQ). (6) 
o The u -diameter VtofQ, which is defined as: 



n := 



, 2m&xV z ^(z) < i/2 maxwfz) — mmcj(z) I, 
V zeQ W " Y \zeQ y ' zeQ y 'J' 



where the concluding inequality follows from the fact that V z u(z) < w(z) — u>(z u ) due to 
(lj'(z u ), z — z u ) > for every z £ Q, see ([5]). Further, by ((SJ), we have: 

||tu — z u \\ < ft for any w £ Q, whence D := max \\w — z\\ < 20. (7) 

w,zeQ 

o For parameter z £ Q° , we define the Prox-mapping as: 

Prox z (£) = argmin [V z (w) + (£,w)] : E -»■ Q° 

w£Q 



(the argmin in question indeed belongs to Q°, see Remark 12.1 
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2.3 Mirror-Prox algorithm with noisy first-order information 



The prototype MP algorithm |Nem04a] is aimed at solving variational inequality © when exact 
values of F arc available. In this paper, we use a modification of the original MP scheme, the 
Stochastic Mirror-Prox (SMP) algorithm proposed and investigated in |JNT08j . which operates 
with noisy estimates of F. Specifically, the algorithm has access to a Stochastic Oracle: at the t-th 
call of the oracle, z t G Q° being the query point, the oracle returns an estimate F^ t (z t ) of F[zt). 
Here, £t is the f-th realization of the "oracle's noise" , which is modeled as a random vector £, and 
F^(z) is a Borel function of £ and z. We assume that the realizations (£t) t>1 of the random vector 
£ are independent. From now on, we set £[ t j = (£i,£2, ■ ■•,£*)■ The algorithm is as follows. 

Algorithm 1 [Mirror-Prox method with noisy first-order information] 



Choose the number of iterations T. Set zq = z u G Q°. 
for 1 < i < T de- 
Given zt-i G Q° 1 choose (a deterministic) 74 > such that: 



4: Call Stochastic Oracle with query point Zt—i and receive rjt '■= F&t-i 0&t-l)- 
5: Set w t = Prox Zt _ x (7*1/*) G Q°. 

6: Call Stochastic Oracle with query point w t and receive Q := F^ 2t (wt). 
7: Set « t = Prax at _ 1 (7 t Ct) G Q°. 
8: end for 

9: Return z T := (j2t=i1t) Y^t=iltw t - 

Note that z t is a deterministic function of £[2t], while w t is a deterministic function of £[24-11- In 
order to show expected convergence of Algorithm [TJ we need to define the following quantities: 

fizt-i ~ {^2t-i(^t-l)l £[2t-2]} ~F(zt-l), 

fj, Wt := E& t |% t (wt)|£[2t-i]} -F{w t ), 

f " "1 \ ) 

F t2t-l( Z t-l) -%2t-l |-^2t-l( Z t-l)l £[2t-2]j , 



(^)-%t{-F&t(«^)IC[2t-l]}. 



where 1 < t < T. Note that we define E 6 \P^{zq)\ £ [0] } := % {%(z )}. Note also that t7 Zt _, 
and CTujj are martingale differences. The following result is proven in the Appendix. 

Theorem 2.1 Let 

^ T _ £ + eLi t*o-«hIU + Ef=i {7*^11^11* + 27? (I K* -^t-ill^ + ll^t -^ t -ill«)} 

(10) 



£t=i 7* 



Moreover, for an operator F that is associated with saddle-point problem |3]J. the following inequality 
holds: 

3 Mirror-Prox algorithm for matrix saddle-point problems 

3.1 Matrix saddle-point problems 

The problem of primary interest in this paper is the Eigenvalue Minimization problem 

Opt = min [X max (A(x) + B) + (c, x}) , (11) 

xeQi 

where: 

o Qi is a convex compact subset of the space E\ = W m equipped with the standard inner 
product (x, y) = x T y; 

o A{x) = X)"=i x iAi is a linear mapping from M. m into the space E 2 — S n of symmetric n x n 
matrices (so that Ai, A m £ S n ), and B £ S n ; 

o ^max(^4) stands for the maximal eigenvalue of a symmetric matrix A; 

o c e K m . 

We equip E2 = S n with the Frobenius inner product (X,Y) F = Tr(XY). Denoting by the 
spectahedron {Y £ S n : Y y 0,Tr(y) = 1} and observing that A max (A) = max Tr(YA), we can 

reformulate (JTTJ) as the following bilinear saddle point problem: 

Opt := min max <j>(x, Y), 4>{x, Y) = (B, Y) F + (A(x), Y) F + (c, x) . (12) 

The associated operator F is: 



F(x,Y) 



F X (Y) := = A*(Y) + c;F Y (x) := ^-^fl = - A (x) - B 



(13) 



where the linear mapping A*{Y) = [Tt(A 1 Y);Tt(A 2 Y); ...;Tr(A m Y)} : S n R m is conjugate to 
the mapping x 1— > A(x). 

We are about to solve the Eigenvalue Minimization problem by applying to (fT2"| the Mirror- 
Prox algorithm. While the problem is fully deterministic, we intend to use an appropriately con- 
structed Stochastic Oracle, which is computationally cheaper than the exact deterministic oracle; 
our ultimate goal is to demonstrate that the resulting SMP algorithm significantly outperforms its 
deterministic counterpart in a meaningful range of problem's sizes. 

We start with presenting the algorithm's setup. 
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3.2 Algorithm's setup 

We assume that the space E\ = R TO is equipped with a norm || ■ \\ x , the conjugate norm being 
|| • \\ x ,*, and that Q\ is equipped with a d.-g.f. u> x {x) that is compatible with || ■ || x . We denote by 
cc Wx and VL X the w^-center of Q\ and the w x -diameter of Q\, respectively; see Section 12.21 

We equip the space E 2 = S n with the trace- norm ||M^||y := X)"=i 1^*0^0 1 > wnere the vector 
\(W) = [X 1 {W);X 2 (W);...;X n {W)} consists of the eigenvalues Xi(W) > ... > X„(W) of W G 5™. 
As it is well-known, the conjugate norm is the usual spectral norm || \WW)\. 
Further, we equip the spectahedron Qi := A^f with the matrix entropy d.g.-f.: 

n 

uj y {Y) = Ent(F) :=^A i (y)ln(A i (Y)). (14) 
i=i 

As shown in [Nes07j . see also jBTN05j . this d.-g.f. is compatible with || • ||y, and as it is immediately 
seen, the corresponding center and diameter of Q 2 are as follows: 

Y UY = -I n ; n Y = y/2 ln(n). (15) 

Finally, we equip the embedding space E — E\ x E2 of the domain Q — Qi x Q 2 of (TT^J) with the 
norm 



1 „ ,,o 1 

2| 



ll(*.*1ll = ^11* + ^H y lly> ( 16 ) 



implying that the conjugate norm is 



\\(x,Y)\U = ^2\\x\\l* + n Y \\Y\\ Y ^. (17) 
The domain Q = Qi x Q 2 of (fT2j) is equipped with the d.-g.f. 

it is immediately seen that this indeed is a d.-g.f. for Q compatible with || ■ ||, and that the 
corresponding diameter of Q is fl = while the w-center of Q is z u = (a; Wx , Y^ Y ) . 

Finally, let C be (an upper bound on) the norm of the linear mapping x <-> A(x) : E\ — > Ei 
induced by the norms || • and || ■ ||y * on the argument and the image spaces: 

Vx £ E x : \\A(x)\\ Y ,* < C\\x\\ x . (19) 

It is immediately seen that the affinc monotone operator F associated with (|12|) (see (|13p) satisfies: 

V(z,z' e E = Ei x E 2 ) : \\F(z) - F(z')\\* < L\\z - z'\\, where L := Vt, x Vt Y C. (20) 



3.3 Randomized Mirror-Prox method for (1121) 
3.3.1 Randomization: motivation and strategy 

With the outlined setup, when applying the deterministic MP algorithm (i.e., Algorithm [T] with 
precise information: F^ t = F) to the variational inequality associated with the saddle point problem 
(|12p . the computational effort at iteration t is dominated by the necessity 
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(A) to compute the value of F at two points, namely at the points z = z t -\ = (x,Y) <G Q° and 
w = w t e Q°; 

and 

(B) to compute the value of the prox-mapping Prox 2 (£) = argmin„, e Q + (C — ^'{z), w — z)] 
at two different points C, E E. 

With our d.-g.f. uj that is "separable" with respect to the x- and to the y-component of z, task 
(B) reduces ("at no cost") to solving the two optimization problems: 

(a) argmim^Q, [uj x (u) + u T [C, x - uj' x (x)]] , . 

(b) Py(Cr) := argmin yeQ2 [Ent(V) + Tr (y [Cy - Ent'(F)] )] 1 ' 

with C x e R m = Ei and £y € <S„ = E 2 readily given by C, specifically, £ = (Sl~ 2 C, x , fiy 2 (V). In the 
sequel, we assume that (a) is easy to solve. The solution of (b) can be written explicitly (see, e.g., 
jBTN05j ). Specifically, since z € Q°, we have Y E Q° 2 = {Y E S n : Y ^ 0, Tr(Y~) = 1}, and the 
latter set clearly is the set of all matrices of the form 

H W = T^£vTy vc ~ s "' (22) 

Assuming that we have at our disposal a representation Y = H(V) with V e S n , the solution 
to (b) is just H(—Cy + V). In other words, when parameterizing points Y G Q% according to 
Y = H(V), prox-mapping (j!?Tj) becomes trivial - it reduces to a matrix addition. The y-componcnts 
of the points w t = (u t ,Wt) and zt = (xt,Y t ) generated by the deterministic MP are of the form 
W t = Py^A^Yhh) and Y t = fy t -i( fi y[Ct]r), where r] U Q € K m x5" are given (see Algorithm UJ. 
When using parametric representations W t = H(Ut) and Y t = H(Vt), the matrices U t and V t are 
easy to update: U t = Vt-\ — f^y[?7t]Y and V t = Vt-i — [CtW, respectively, with r] t , (t as defined in 
Algorithm [T] Thus, when representing Wt and Y t by their "matrix logarithms" Ut and Vt, it looks 
as if the computational effort per step of MP as applied to (fT2j) were dominated by the necessity to 
resolve task (A), and in task (B) — to solve the problem (pHI a) alone. This impression, however, 
is not fully true. Indeed, looking at (IT51) . we observe that — while computing the y-component 
Fy(x) = — A(x) — B of F at a point z = (x, Y) needs the knowledge of x only and is independent of 
how y is represented — computing the y-component F X (Y) = [Ty(AiY); Tr(A m y)] + c of F(z) 
seemingly requires the explicit representation of Y. This latter observation makes it necessary to 
solve explicitly problems (|21lb). or, which is the same, requires computation of the value of H at 
a given point V. The related computational effort is 0(n 3 ) (the arithmetic cost of an eigenvalue 
decomposition of V), which, depending on the problem's structure and sizes, can by far dominate 
all other computational expenses at an iteration. The goal of this paper is to demonstrate that one 
can avoid the explicit solution of "troublcmaking" problems (HU6), and use instead the easy-to- 
updatc "logarithmic" representations at the cost of a randomized computation of F x (-). The idea 
of randomization is as follows: assume that we are given a "matrix logarithm" V of Y e Q%, so 
that y = H(V). We need to compute a randomized estimate of the vector F X (Y) = A*(Y) + c. 
that is, of: 

A*(Y) = [Tr^y); . . . ; Tr(A m Y)} = —-J-—^^ exp{y}); . . . ; Tr(A m exp{V})]. 
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Imagine for a moment that we can multiply vectors by the matrix exp{V/2}. Then, generating 
a sample £ of A independent vectors £ s ~ Af(0,I n ), 1 < s < A, and setting \ s = exp{V72}£ s , 
s = 1, 2, A, we have: 

|y ^[[xT A lX s ; • • ■ ; [x1 T ^mXl| = [Tr(Ai exp{V}); Tr(A m exp{F})], 



and: 



E {^EMv|=Tr(ex P m), 



so that we can use the random vector 

9*00 ■= E ^^;;^ AmXS] + c, X °:=eMV/2K% (23) 
i LX J X 

as a random (biased!) estimate of F X (H(V)). The last strategic question to be addressed is how 
indeed to compute, given V and a vector £, the vector x = exp{y/2}£. We propose to build a high 
accuracy approximation \ to x by setting 



with J large enough to guarantee a desired accuracy, and to compute the terms Vj = jj-(V/2) J '£ 
by successive matrix- vector multiplications: — £, Vj+\ = 9(j+i) V v j- We then merely replace in 
(|23p the vectors x s with their approximations % s , thus getting an estimate 

^.g.Wf;,^,, (25) 
E s =iixTx s 



of <?^(V). We also set 



H ? (y) 



A' 



E^ T ; 



EK s [x s ] t ]; (26) 



note that 7-^(V) £ Q2 = can be considered as a random estimate of H(V) (see ([22]) ). and that 
g s (V) = F X (H S (V)). 



3.3.2 Randomized algorithm 

Implementing the outlined randomization strategy with the setup presented in Section 13.21 Algo- 
rithm [1] becomes as follows: 

Algorithm 2 [Randomized Mirror-Prox method applied to matrix saddle-point problem (fT2j) ] 
1: Choose the number of iterations T, the sample size A, and a sequence of positive integers 
Jti 1 < t < T. Generate 2T independent samples £1, £2, £2T, each of them comprised of A 
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independent realizations £j ~ A/"(0, /„), 1 < s < N. 



Set xo = x" 1 and let Vo £ S n be the all zero matrix, 
for 1 < t < T do 

Given (xt-i,Vt-i), choose (deterministic) -f t > such that (cf. (|20[) . (|19p) 
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7t < = -f • (27) 

5: Compute the approximation 

J&*-i(a:t-i,Vt_i) = [se«-i(^-i); -s-^-i)] 

of F(x t _i,'H(^_i)) = [^*("H(V^_i)) + c;-B->l(a;t-i)], where (fe^G) is as explained in (25j). 
with J t in the role of J. 
6: Set 

.T t = argmm {(Tt^fl&t-iC^-i) _ u ' x (xt-i), x) + u x (x)} 
V t = Vt-i + Ttfiy (B + A(x t -i)) ■ (28) 
7: Compute the approximation H t := %{ 2t (Vt) of 'H(Vt) and the approximation 

F^(xt,V t ) = [g i2t (V t ):-B-A(x t )] 

of F{x t ,'H{V t )), where %£ 2t (Vt) and <?£ 2t (-) are as explained in (|26| and (|25)l . respectively, and with 
J t in the role of J. 
8: Set 

a; t = argmm {(7t^<fe«04) - <4(zt-i), x) + u x (x)} 

V t = Vt-i + 7t«y (B + A(xt)) . (29) 

9: end for 

10: Return x T := \Y%=i7t) Ylt=i 7t*t- 



3.3.3 Convergence and complexity analysis 

Regularity assumption and preliminaries. In order for Algorithm [5] to be well behaved, we 
need certain additional assumption on (Ei, ||-|| „), specifically, the one of regularity with certain 
parameter k = Ke 1 ■ Instead of stating this notion here in full generality (this is done in Section 
[X]of the Appendix), let us just hint that the property has to do with "good behavior" of sums of 
martingale differences taking values in E\ and list the regularity parameters for the most important, 
in regard to applications, pairs (E\, \\-\\ x „). Specifically, denoting by | • \ p the spectral £ p -norm on 
the space M. m of m x m matrices, that is, \A\ p = \\<j(A)\\ p , where <j{A) is the vector of singular 
values of A, the following holds true (from now on, all C(l)'s are appropriate absolute constants): 

(!) If 1 < p < 2 and either (E u ||-||J = (R m , \\-\\ p ), or (E u ||-||J = (M m , \ ■ \ p ), then the 
regularity parameter of {E\ 1 \\-\\ x t ) is equal to 1 when p = 2, is bounded from above by 
—^j when p > 1, and is bounded from above by 0(1) ln(m + 1). 
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From now on, if the opposite is not explicitly stated, it is assumed that (E 1 , \\-\\ x J is K-rcgular. An 
instrumental role in the convergence analysis of Algorithm [2] is played by the following fact (proved 
in Appendix). 

Proposition 3.1 Let {E\, \\-\\ x „) be n-regular for some k. With F given by i!3\) and g^(-) given 
by l2Sfy , one has for every V £ S n : 



(a) \\^{g^V)} - F X {H{V))\\ X ,, <0{1)L^N- 



(b) E e {exp { 



^N\\9i(V)-^{m(V)}\\*,* 



}}<cxp{l}. 



(30) 



Another component of our analysis is the following simple statement (recall that is the usual 

spectral norm on 5™): 



Proposition 3.2 Let W <E S n and J > cxp(2)|| W^||y,*. Then, 

< exp{— J}. 



exp{W} - E -T 



3=0 



(31) 



y* 



This result can be proved by applying the same arguments as in the proof of Lemma 6 in |AK07j . 



Convergence analysis. Proposition ^ . 21 shows that in order to approximate the matrix exponent 
exp{W} by its Taylor polynomial within accuracy e -C 1, it suffices to take for the degree of the 
polynomial the number J = 0(1) ln(l/e)||T-v"||y >t , so that J is "nearly independent" of e. Now, 
when e is really small - like 10 -16 or even 10 -100 - any e-approximation of the matrix exponent 
is, "for all practical purposes," the same as the matrix exponent itself. Assuming that the choice 
of J indeed ensures "really small" inaccuracies in the approximation of the matrix exponent, we 
have all reasons to undertake a simplified convergence analysis of Algorithm [2j where we neglect 
the difference between the quantities g^(-) as given by (|23|) and their estimates g^(-) (defined in 
(|25p ). Or, alternatively formulated: we analyze the idealized version of the algorithm with g^(-) in 
place of <?£(•)• 

The convergence analysis of the idealized algorithm is as follows. Let 1 < t < T, and let j t 
satisfy (f2~T)) . Note that in the notation of Algorithms [1] and [2] and of definitions © we have: 

- (xt- lt H(Vt-i)), w t = (x t ,H(V t )), 
f*t-i = [% f -i {flC»-i(Vt-i)|C[»-2]}--F*fe-i);0] =: [m^.jO], 
(j, Wt = pE &( {3^(^)1^-1]} -F x (w t );0} =: [fil t ;0], 
CT Zt _! = [g^-dVt-l) {9S2t-i(Vt-i)\£[2t-2}} ;0] =: K^o], 
° wt = [fe 4 (^)-E{^ 2t (y t )|C [2t _i]};0] =: K f ;0]. 

By (|30la) combined with (|17[) . we obtain: 



E 



% t {H^IUl^t-i]} = n x E i2t {IIm^IU,*!^-!]} < o(i)n x c^EN-\ 



(32) 
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whence also: 

% r] {EL7tlKJ*} = ^E S[2T] {eL^IKIU*} <0(i)fW^v-i£f =1 



7«; 



(33) 



^pt] {ELi7t 2 ||Mwt =^x E e[2T] {ELi7 f 2 ||M«> t -M^-illx,*} 

<0(l)0^iV- 2 Ef =1 7t 2 . 
Further, inequality (|30l b) implies: 

{IK., III,* l^at-a]} < 0(1)£*k/N, E 6t {||<|| 2 ,^[2 t -i]} < 0(1)C 2 k/N. (34) 
Moreover, by the definition of a? and cr^, , we have: 

% t - 1 K t _J^-2]} = 0, % B {<|fpt- 1 ]} = 0. (35) 
Since (£1, || ■ || ^ t ) is K-regular, relations (jMJ) and (|3"5j) imply by Proposition 3 of [Ncm04b that: 



T 



which results in: 
E C[2T] { ELi7tCTto t J = ^r% T] I ELi7*<^ 

Besides this, ([3~4")l implies that: 



<v x je (12T j eLi^s 



(36) 



^ 2 % T1 



. t=l 



E7 t 2 ll<-<_J 2 ,*} 



< 



(37) 



Combining (|33|) . (|36|) . (|37]l . and taking into account that with our setup f2 = s/2 and £> < 2Q, we 
conclude from Theorem 12.11 that : 



E {^(ar 1 ) - mm ieQl 0(a:) j < 0(1) ! 1 > 

where Y) is the cost function of the saddle point problem (fl"2"|) . 0(x) = max <^(:e, Y) is the 

objective in the problem of interest (fTTj) . and Opt = min 0(x) is the saddle point value in (fT2|) . or, 
equivalcntly, the optimal value in ()11|) . 

Optimizing the resulting efficiency estimate in the stepsizes 74 satisfying (|27[) , it is immediately 
seen that with 



N = floor 



V20 



= floor 



Tk 
41n(rz 



1 



1 



It = 



V2n x n Y c 2^/111(71) 



, 1 < t < T, (38) 
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the above inequality implies: 



x r ) - min 4>{x) \ < 0(1)%^ 



ln (») , rm 

— pr- + y«;In(n) 



(39) 



4 An application: minimizing the maximal eigenvalue of a 
convex combination of symmetric matrices 



Consider the special case of problem (fTTj) where Qx is the standard simplex in 

Qi = A m := 
and B = 0, c = 0, so that the problem is 



e K m : x > 0, ^ Xj = 1 



Opt = min X max (A(x)), A{x) =^ XjAj. 



[m > 2] 



(40) 



In other words, we want to minimize the largest eigenvalue of a convex combination y ■ A,- of 
given symmetric to x to matrices. Note that the problem of minimizing the maximal eigenvalue of 
B + A(x) over Qx reduces to (|4"0")) by replacing the matrices Aj with Aj + B. The operator (fT3")) 
associated with the problem is 



F(x,y) = [F a (y);F„(a;)] := [[Tt(AiY); ...;Tx(A m Y)}; -A(x)} . 



(41) 



We equip i?i = W' n with the norm || • || ^ = || • ||i; the conjugate norm is || • = || • ||oo. It is 
well known that (R m , || • W^) is K-regular with n = 0(1) ln(m) (see, e.g., Example 2.1 in Ncm04b ). 
Note that this choice of || • ||i results in 

£= max \\AA\y,* (42) 

l<j<m 

(see ([20]) ). where ||A||y* is the spectral norm of a matrix A. 
We equip Qx with the d.-g.f. function: 

m 

w x (x) = Ent(a;) := 'S^Xj hi(aJj)> 

3=1 

which, as it is well known (and immediately seen), is compatible with || • ||i. The associated entities 
are 

" 1 1 " 



x e W n : x > 0,^2^ = 1 

3=1 

and the prox-mapping is given by an explicit formula: 



Q x = ^2\n(m), (43) 



arg min {oj x (w) + (£ - oj' x (x), w)} 
. weQi 



cxp{ln(a; j ) - ^} 
Y,7=i exp{ln(a; £ ) -&}' 



1 < j < m, 
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see Section l5.3.1l 

Let us solve (|40p by T-step Algorithm [5] associated with the outlined setup and the parameters 
N and 7 chosen according to ([55]). that is, as: 

N = {l)^T t 7^7 = ; 1 (44) 

V Mn(n) 2£ A /21n(n)ln(m) 



The efficiency estimate ([39]) now reads: 

E {A max (^l(a; T )) - Opt} < 0(1) (ln(n) + ln(m)^h^^ C/T. (45) 



Choosing truncation levels J t . Let us specify the "truncation levels" Jt for 1 < t < T. In view 
of p8|). ([29)1 , (|42j) and taking into account that Vq = 0, Qy = y/2ln(n), and O x = v /21n(m). we 
conclude that: 

HVtlly,, < 0(l)VMn)/In(m)t, ||Ft|k* < 0{l)^ln(n)/\n(m)t. 

From Proposition l3.2[ we deduce that the matrix exponentials we need to use can be approximated 
with accuracy 6 -C 1 by a truncated Taylor series with J t = 0{l)y/\n{n)/ ln(m) ln(l/5)t terms. 
Specifying <5 as, say, machine accuracy, we see that "for all practical purposes" it suffices to take 

Jt = 0(l)vdn(n)/ln(m)i, t>l, (46) 

with a moderate absolute constant 0(1). 



Overall complexity. Assume that we want to solve (|40[) within accuracy e in terms of the 
objective. This task is typically unreachable with a randomized algorithm. Instead, we need to 
content ourselves with a procedure returning an e-solution with a prescribed probability of at least 
1 — /3, where < [3 <C 1. To build such a procedure, we can specify T = T(e) in such a way 
that the right hand side in (|4"5)) is at most e/4. We run the above T(e)-step algorithm k times, 
each time computing an accurate, within the margin e/2, estimate of the value A max (*4(a; T ' 4 )) of 
the objective at the corresponding output x T ' 1 , 1 < i < k, and then select among the k outputs 
x T,1 1 x T ' k the one with the smallest estimate of the objective value. Since with our choice of 
T(e) we have Prob{A max (x T ' z ) — Opt > e/2} < \ and x T,1 ,...,x T,fe are independent, this procedure 
yields an e-solution to the problem of interest with a probability of at least 1 — f3 for a "small" 
k = 0(l)ln(l//3). 

Now, let us evaluate the computational complexity of a single T(e)-step run of Algorithm [2] 
Assume that every matrix Ai has at most 5* nonzero entries. We assume that mS > n 2 , meaning 
that the matrices A{x) can be fully dense. In order to avoid intricate expressions, we omit in the 
sequel all factors that are logarithmic in m, n and 1//3 (in particular, all absolute constant factors) 
and write down the statement "P is, within logarithmic factors, bounded from above by Q" as 
P ^ Q- We also write P ~ Q when both P < Q and Q < P. Finally, we set v = e/C; this quantity 
can be naturally interpreted as the relative accuracy of an e-solution. To establish the complexity 
of our procedure, note the following. 

(A) By p5|) . the required number of steps T = T(e) admits the bound T < 1/V, whence, by (|44)) . 
N < l/u. 
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(B) As it is immediately seen, when mS > n 2 , the computational effort at step t < T of the 
algorithm is, within factor 0(1), dominated by the necessity 

1. to compute A{x) at a given point x, (< mS arithmetic operations (a.o.)); 

2. to generate N samples £| ~ Af(0, I n ) with 1 < s < N, (totally < nN a.o.); 

3. to compute for every s < N the vectors xt = Ej=o(^/2) 3 '£t A?': where V G 5™ is a given 
matrix (totally < iVtn 2 a.o., see ((35])); 



4. to build the matrix H 



EtMFxl E J ;=i X? feT ( < ^ 2 a.o.); 



-1 



5. to compute the vector [Tt(HA 1 );Tr(HA 2 ); ...; Tr(H A n )} (< mS a.o.). 

We see that the complexity of step t is < Ntn 2 + mS a.o., implying that the overall complexity of 
a single run of the algorithm is < NT 2 n 2 + TmS < n 2 jv i + mS/v a.o. We then should compute 
the value of the objective at the resulting approximate solution, that is, the maximal eigenvalue 
of a symmetric matrix with the spectral norm not exceeding C. For our purposes, it suffices to 
approximate this value (1 — P/k) -reliably within accuracy 0(e), which can be done by the Power 
method at the cost of < n 2 /v a.o. Finally, we should repeat this procedure 0(l)ln(l//3) times. 
Omitting constants and factors logarithmic in m, n, and 1//3, our randomized algorithm yields an 
(1 — /3)-rcliablc e-solution to (|4T)|) at the cost of 

n 2 mS , . n 

Csmp = — t H a.o. [v = e L\. 

v v 

Discussion. Let us compare the complexity of our algorithm with those of its existing competitors. 
To the best of our knowledge, the best existing complexity bounds for large-scale problems (|4U|) are 
as follows (we again skip logarithmic factors): 

o The complexity for Interior Point methods without any assumptions on Aj aside of their 

sparsity is 

Cipm = \Anax[n,m] [n 3 + m 3 + m 2 n 2 } In(l/i>) a.o. 

o Advanced deterministic first-order algorithms, like Nesterov's Smoothing [Ncs07j or determin- 
istic Mirror-Prox, require 

Cfom 



o We can also consider minimizing the original objective function x i— > A max ( X^ x jAj ) over 
the standard simplex using a "slightly randomized" Mirror Descent method |d'A08bj . This 
method requires 

n 2 mS 

Cmd 



2,5/2 
,2 



The iteration count in this method is ~ 1/v . The computational effort per iteration reduces 
to assembling A(x) at a given point (~ mS a.o.) and computing an e-subgradient of the 
objective and an e-approximation of the value of the objective at x by applying the Power 
method to the matrix A{x) in order to approximate its maximal singular value and leading 
eigenvector. With a straightforward implementation of the Power method this task requires 
~ n 2 jv a.o., and with an advanced implementation ~ n 2 / \/v a.o. only. 
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N 


CPU time [sec] 
mean std 95% conf 


number of iterations 
mean std 95% conf 


CPU time per iteration 
[sec/iteration] 


1 


66 


9 


[60, 72] 


2948 


327 


[2746,3151] 


0.0224 


3 


90 


14 


[81,98] 


2970 


343 


[2757, 3183] 


0.0302 


5 


86 


11 


[79,93] 


2900 


271 


[2732, 3068] 


0.0298 


10 


87 


12 


[80,94] 


2860 


207 


[2732, 2988] 


0.0305 


50 


92 


7 


[87, 96] 


2840 


232 


[2696, 2984] 


0.0323 


100 


98 


9 


[93, 104] 


2850 


207 


[2722, 2978] 


0.0344 


500 


141 


15 


[131,150] 


2860 


222 


[2722, 2998] 


0.0491 


1000 


178 


18 


[167, 189] 


2860 


222 


[2722, 2998] 


0.0622 


5000 


533 


47 


[504, 562] 


2877 


204 


[2750,3003] 


0.1851 



Table 1: CPU time (mean, standard deviation, and 95% confidence interval), number of iterations 
(mean, standard deviation, and 95% confidence interval), and average CPU time per iteration 
required by the stochastic Mirror-Prox method for solving random instances of problem (|47|) with 
parameter values n — 100, m = 100, e = 0.002, and for different samples sizes N. The matrices Aj 
have a joint sparsity pattern and, in expectation, 10% of the entries are non-zero. 



It turns out that there exists a meaningful range of values of m, n, S, and v where our stochastic 
algorithm significantly outperforms the outlined competitors. For example, consider the case when 
n is large, and assume that we have for some < k < 1/4: 

mS ~ n 13 with 2 + 2k < f3 < 3 - 2k, n max[2-/J -i/2]+« < y < n i-/3/2 

(note that the outlined range of values of v is nonempty; e.g., this range is n~ 1 ' 2+K < v < n^ 1 / 4 
with P = 2.5 ). It is immediately seen that in the case in question we have Csmp ~ n 2 /v 3 and: 

Cipm > n3K; Cfom > ^ Cmd > ^ 
Csmp Csmp Csmp 

that is, our algorithm progressively outperforms its competitors as the sizes grow. 



5 Numerical experiments 

We consider randomly generated instances of the problem 

m 

Opt = min X max (A(x)), A(x)^Y^x J A J , A, / -'<".. (47) 

i=i 

where Cj is a sparse symmetric random n x n-matrix and j = 1, . . . , m, i.e., we are confronted with 
instances of problem (|40"]) that we studied in the last section. We solve these problem instances up 
to a (relative) accuracy of 6 := eC, where < e < 1 is the target accuracy and C is defined as in 
(|4"2"j) . In all the numerical experiments that we perform, the target accuracy e is set to 2 ■ 10~ 3 . 

We implement Algorithm [2] with constant step-sizes 7t = 7, t > 1, and 7 has the form described 
in (|4"4")l . Given a matrix W £ S n , we choose the truncation level Jw of the matrix exponential 
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Taylor series approximation according to the following formula (compare with Proposition 13. 2[) : 



.7 



:{log(l/p),exp(l)||IU|| (oo) } 



1(T 



Note that this setting slightly deviates from the truncation level derived in Proposition 13.21 The 
oo-norm of W is computed approximately using the Power method. In accordance to (|54"|) and f4"5)l . 
we need to choose the sample size N as: 

N= m^M. (48) 

In Table 1, we give the CPU time (mean, standard deviation, and corresponding 95% confidence 
interval), the number of iterations needed to find a solution with relative accuracy eC (mean, 
standard deviation, and corresponding 95% confidence interval), and the average CPU time per 
iteration for different samples sizes N. The matrices Aj are all of size 100 x 100, they follow the 
same sparsity pattern, and, on average, 10% of the entries are different from zero. In total, we 
have a hundred matrices Aj. All the numerical results that we present in this paper are averaged 
over ten runs and are obtained on a computer with 24 processors, each of them with 2.67 GHz, 
and with 96 GB of RAM. We observe that the smaller the sample size the lower the CPU time 
that is required to approximately solve the problem instances. Surprisingly, we can choose a very 
small sample size without sacrificing too many iterations. Let us illustrate this observation with an 
example. According to (|48|) and with an absolute constant of 1, we are supposed to choose N as 
about 5000. With this parameter choice, we need an average CPU time of 533 seconds. Using only 
one sample for each matrix exponential approximation, we observe that we can reduce the average 
CPU time by 87.6% (with a slight increase of 2.5% in the average number of iterations). For the 
subsequent tests, we will thus choose a sample size that deviates from its theoretical value and use 
only one sample for every matrix exponential approximation. 

Given a pair (x, Y) of a primal and a dual feasible solution, we can compute the corresponding 
duality gap 

AaaxC^x)) ~ mm (A(x), Y) p , (49) 

which we use as stopping criterion for our algorithm and which we check at every 100th iteration 
of the method. The first term is approximated by an adapted version of the Power method and the 
second term is simply min{ {Aj , Y ) f '■ 1 < j < ni}. As the Power method typically returns a lower 
bound on the eigenvalue of largest absolute value, we recompute the duality gap using the Matlab 
built-in functions max() and eig() when the first approximation obtained by our version of the 
Power method yields to a value that is smaller than eC. We denote by H(V) the approximation of 
exp(V^)/Tr(exp(y)) by the truncated Taylor development. The pair (x,Y) considered at iteration 
t is the average 

1 ^ , 1 1 



Lt=1 Tt t —\ t —\ 



where x T and V T are defined in Algorithm [2] equations (|28|) . In principle, the criterion (|49|) gives 
theoretically a desirable solution only if we use exact scaled exponentials instead of H(V T ). Nev- 
ertheless, H{y) is in the matrix simplex by construction, and the number of terms we use in the 
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CPU time (mean [sec], standard deviation [sec]) 



(n,S) 


Mirror-Descent 
mean std 


det Mirror-Prox 
mean std 


stoch Mirror-Prox 
mean std 


ratios CPU time 

det MP stoch MP stoch MP 
MP) MH det MP 


(100,955) 
(200,3813) 
(400, 15255) 
(800,60971) 


307 47 
766 79 
2522 120 
10262 746 


128 16 
307 15 
1101 42 

4983 126 


71 6 
237 17 
744 39 
2814 74 


0.42 0.23 0.55 
0.40 0.31 0.77 
0.44 0.30 0.68 
0.49 0.27 0.56 



number of iterations (mean, standard deviation) and average CPU time per iteration [sec / iteration] 



(n,S) 


Mirror-Descent 
mean std CPU f timo 

iteration 


det Mirror-Prox 
mean std c t PU f timo 

iteration 


stoch Mirror-Prox 
mean std CPU t timc 

iteration 


ratios CPU time / iteration 

det MP stoch MP stoch MP 
MD MD det MP 


(100,955) 


21504 


3287 


0.0143 


3120 


349 


0.0411 


3000 


313 


0.0235 


2.88 


1.65 


0.57 


(200,3813) 


21388 


1858 


0.0358 


2700 


141 


0.1137 


2740 


171 


0.0866 


3.17 


2.42 


0.76 


(400, 15255) 


22293 


924 


0.1131 


2680 


103 


0.4109 


2620 


103 


0.2841 


3.63 


2.51 


0.69 


(800,60971) 


22327 


445 


0.4596 


2740 


52 


1.8187 


2600 


47 


1.0822 


3.96 


2.35 


0.60 



Truncation level J of Taylor series approximation (only stochastic Mirror-Prox) 



(n,S) 


mean std 


(100,955) 
(200,3813) 
(400, 15255) 
(800,60971) 


9 < 1 

9 < 0.5 

10 < 0.5 
10 < 0.5 



Table 2: CPU time (mean and standard deviation), number of iterations (mean and standard deviation), and average CPU 
time per iteration needed by the Mirror-Descent (MD), the deterministic Mirror-Prox (det MP), and the stochastic Mirror- 
Prox (stoch MP) method for solving random instances of problem (|4T|) with parameter values m = 100 and e = 0.002, with 
S non-zero entries, and for different values of the matrix size n. The performance ratios express the CPU time (CPU time 
per iteration) required by "method A" as percentage of the corresponding quantity used by "method B". The stochastic 
Mirror-Prox method is implemented with TV = 1, and the used truncation levels J are shown in the table at the bottom. 



Taylor exponential is large enough to justify a very accurate approximation, so that Y can be 
considered as an adequate approximate solution to our problem. 

In Table 2, we compare the performance of our randomized version of the Mirror-Prox method 
with the efficiency of its deterministic counterpart and of the Mirror-Descent scheme for random 
problem instances (|47p . As before, we have a hundred matrices Aj, but this time their size is 
varying. They arc sparse with a joint sparsity pattern and with S non-zero values; the values for S 
can be found in Table [5] In this table, we show the CPU time (mean and standard deviation), the 
number of iterations required to find a solution with accuracy eC (mean and standard deviation), 
and the average CPU time per iteration. Moreover, we express the average (total) CPU time and 
the average CPU time per iteration of the Mirror-Descent method (deterministic Mirror-Prox) in 
percentage of the stochastic and the deterministic Mirror-Prox method (stochastic Mirror-Prox). 
We observe that the stochastic Mirror-Prox method has an average CPU time that corresponds to 
23 to 31% of the running of the Mirror-Descent scheme and to 55 to 77% of the CPU time required 
by the deterministic Mirror-Prox method for problem instances involving matrices of size 100 x 100 
up to size 800 x 800. For the stochastic Mirror-Prox method, we also give the truncation levels J, 
which are nine or ten on average for these problem instances. 
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A Large deviations of random sums 

Let E be a Euclidean space with inner product (•, •) and let k > 1. We endow E with a norm 
denoted by ||-||, which might differ from the one associated with this inner product. 

We say that the space (E, ||-||) is K-smooth, if the function p(-) := ||-|| 2 is continuously differen- 
tiablc on E and 

p(x + y) < p{x) + (p(x), y) + np(y) V x,y G E. 

Both the space (E, ||-||) and the norm |j-|| are called K-regular, if there exist k + € and a 

norm ||-||+ on E with the following two characteristics: 

o The space (E, ||-|| + ) is K+-smooth. 

o The norm ||-||+ is /i/K + -compatible with ||-||, that is: 

INI 2 < IMI+ < — IMI 2 VxeE. 

K + 

The regularity constant ke of the space (E, ||-||) is defined as: 

ke '■= inf{«; > 1 : (E, ||-|j) is K-regular}. 

From now on, assume that {E, ||-|j) has regularity constant Ke- 

We define an 77-dimensional martingale difference sequence £i,...,£r, that is, a sequence of 
77-dimensional random vectors such that !;t-i is er(£ t )-measurable and E^ t {Ct|^t-i} = for every t. 
In our context, the following result on regular Euclidean spaces is of particular interest. 
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Theorem A.l jNemOjb_ ] Choose \ G (0, 2] and reals a\, . . . , ax > such that: 

Ef, {exp|^|f j| 6_x} <exp{l} Vt = l,...,T. 
fa) for aZZ c > 0, we have: 





T 




T 






> c. 










t=l 



<C x exp(-c*/C x ), 



where C x > 2 is a properly chosen constant that solely depends on x md that is continuous 
in x- 

(b) With a properly chosen constant c x > that solely depends on x and that is continuous in x, 
we have: 



E C[T1 S CX P 



„2 



< exp{l} 



For a proof of the first part of the above theorem, we refer to Ncm04b . Statement b) follows 
from a) by integration. 



B Proofs 

B.l Proof of Theorem EH] 

The proof of Theorem 12.11 requires a result from j JNT08] . which we reproduce below. 

Lemma B.l [JNT08] Let z e Q°, 7 > 0, and 77, £ £ E. Consider the points 

x := argmin {(777-0/(2), 7/) +uj(y)} 
z + := argmin{(7C- u'(z),y) + uj(y)}. 

Then, 

( 7 C,ar- y) < V z (y) - V z+ (y) + ^ \\ V - C\\l - \ IN - z\\ 2 V y e Q. 



Let us show Theorem 12. II 
Proof 

We can represent the random elements F(z t -i) and F(wt) as follows: 

F(zt-i) = F&^zt-i) - cr Zt _ 1 - fi Zt _ 1 and F(w t ) = F^ 2t (w t ) - a Wt - fi Wt , 
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respectively. Let u £ Q. As F is a monotone operator, we have: 

T T T 

Y It (F(u), z T - u) = ^2 It (F(u), w t - u) < ^2 It (F(w t ), w t - u) 



t=i 

T 



7t ( F t2 t ( w t) ~ Vw t - Hw t ,w t -u\ . (50) 



By Theorcm lB.il we obtain: 



T / 2 
It 



Furthermore, 



2 1 



ii 1 1 2 

< 2 (\\F{w t ) - F(z t _i)|| 2 + \\a Wt - cr^ + /i„ t - fi Zt _ 1 \\ 2 ^j 



F n 2t {wt) - F^^zt-i] 



<2[L \\w t — zt-x\ + || <J Wt - Vzt-! + yiw t - Mzt-i | 

where the concluding inequality is due to the Lipschitz continuity of F. Observe that 7 2 £ 2 < \ 
because of the step-size choice ((8j). Thus, 

T T 
^7t(F i2t (w t ),Wt-u) < V z »{u) + Y2lt \\°w t -<Tzt-i +Vw t -Ma t -i||! 



t=l 



t=l 



< V + (H^« - H« + H^*+A*. t -ill2) • (51) 



1=1 



Additionally, the following inequality holds: 

T T 

y^7t {<Tvi t + Vw t ,u - w t ) = ^It ((cr Wt ,u- z u ) + (a Wt ,z u - w t ) + (fx Wt ,u- w t )) 



< n 



J2"/tV Wt 



t=i 



Y^lt {{ow t ,z u -w t )+D ||/i Wt ||J . 



We combine the above inequality with (|50)) and ([ST]) : 

T n 2 T 

Y,lt(F(u),z T -u) < + 2^7^(11^-^-! II* + II*) 



t=i 



-9. 



5^7t {{°w t ,z u - io t ) +-D UmiuJJ 



Maximizing the left-hand side of the above inequality with respect to u £ Q and taking expectations 
on both sides, we obtain: 

]r 7t E? [2T] {e(z T )} < ^ + 2^7 2 E 5[2T] {\\a Wt - a Zt _, || 2 + ||^ + fx^ || 2 } 
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'£[3T] 



T 



t=l 



+ I^7tE 5[2T] {K,*,^ — u>t) + -D || Muit ||J . 



It remains to observe that JE£ [2r] {{o~w t ) £ w — w t )} = 0, as is a martingale difference. 
Assume that F is associated with the saddle-point problem ([3]), that is: 



F(x,y) 



d(f>{x,y) _ d(/)(x,y) 
dx dy 



Recall that t € {1, ...,T}. Let w t = (xt,yt) G Q := Qi x Q 2 , u = {u x ,u y ) e Qi x Q 2 , and 
At : = 7t/ X)i=i 7*' As ^ ne function </> is convex in the first and concave in the second argument, we 
have: 

T T 

X A t (F(w t ), w t -u) > X A t (cj)(xt,yt) - <P{u x , y t ) + <p(x t ,u y ) - <p(x t ,y t )) 
t=i t=i 

T 



> 



t=l 

I T \ / T > 

X ^txt,u y -cf)lu x ,^2 A*2/t 



It remains to apply the same arguments as above in order to complete the proof. ■ 

B.2 Proof of Proposition 13.11 

For V e 5„, we define 

g(V) := h(V) + c, where := X ■ (52) 

Recall that . . . , £ N are independent Af(0, I n )-distributed random vectors and £ = . . . , ^ Ar ). 
Let ff{(^) be defined as in (j2"3"|) . that is: 

ge{V):=ht(V)+c, where h ( (V) := A* (53) 



with 



We start with the observation: 

Assumption B.l .As </ie standard multivariate normal distribution Af(0,I n ) is orthogonal invari- 
ant, and as both h(V) and fi£ (V) are invariant under positive scaling, we can assume, without loss of 
generality, that exp{V} is diagonal (with positive diagonal entries) and of trace 1. This assumption 
shall hold for the rest of this section. 



24 



For i = l,..., N, we set: 

1 N 

D e := (exp{F/2}f ) (exp{F/2}r) T - exp{^}, d e := A*D e , d e := - ]T d 

i=l 

and: 

1 W 

A == [f] T exp{ne-l, /e:=^E/C- 

Lemma B.2 For an appropriately chosen constant c > 0, we /iaue /or any i = 1,. . . , iV.: 
aj = Epdp = E^ = E e f e = E 6 f 6 = 0. 

b) Ee{exp{^*}}<cxp{l}. 

c) E^{exp{^^}}<exp{l}. 

d) E e {cxp{^|-}}< e xp{l}. 

e) E 5l {exp{^}} <exp{l}. 

f) E e {cxp{^M}}<cxp{l}. 
Proof 

Assume that dia<7(exp{y}) = (vi, . . . , v n ) T , where u % > for any i = 1, . . . ,n and J^ILi w » 

a) Let i E {1, . . . ,n}. We have: 

E e D e = exp{V/2}E e {?[?] T } cxp{Vy2} - exp{^} = 0, 
where the concluding equality holds as £ 2 ~ A/*(0, / n ). Moreover, 

n 

E ( .{ie} T cxp{v}e} = J2 v x = 1 > 

k=l 

which proves Ej«/^« = 0. The remaining equalities follow immediately. 

b) Assume that the n-dimensional random vector £ is A/"(0, I„)-distributed. Then, 

n 

= E^- 

i=l 

For any < ci < -__tEl__> (< i) ; it holds that: 

n 

} = rjE c exp{ ClWl C 2 } 

i=i 
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(exp{F/2}C) (exp{F/2}C) T 



E c cxp {d (exp{F/2}C) (cxp{F/2}C)' 



Tl 

= H(i-2 Cl v t r l/2 

i=l 

= exp j-^^ln(l-2ciUi) j 



< exp|--ln(l-2 Cl ) 

< cxp{l}, (56) 

where the first inequality holds as the maximum of — Y^i=i m 0- ~ 2ciUj) over the probability 
simplex is attained at an extreme point (note that the function v n- — E™=i m (1 ~ 2cii>,) is 
separable and each of its components is convex). We obtain: 

cxp{l} > Cl E f ||(exp{F/2}C) (exp{^/2}C)' 

Using Jensen's inequality, it follows that: 

E c exp{^-|^|| XX T -E cXX T || y J <exp{l}, X := exp{U/2}C. 

We conclude by observing that (exp{F/2}C) (cxp{l//2}C) T - E c (exp{V/2}C) (exp{U/2}C) T 
has the same distribution than any of the matrices D^i, i = 1, . . . , N. 



c) Let i = 1, . . . , N and < c 2 < n — C1 m ■ Due to a), we obtain: 

' 1 ' z — 2cxp{l} '* 



E^i exp 



c 2 \\de 



C 



— > = E^i exp 



< E^i exp ■ 



C2 LDfl 



< exp{l}. 



d) According to Theorem IA.1I (note that we apply Theorem I A. 1 1 with X = 1 and <Ti = C/c2 for 
any i = 1, . . . , N), there exists C3 > such that: 



E^ exp 



C2VN\\ds\\ x . 



< exp{l}. 



e) Here, the n-dimensional random vector £ is 7V(0, I„)-distributed. Due to ([53]) and (|56"|) . we 
obtain: 



E, 



expjy |C T exp{F}C-l|} <cxp|i| ^E c exp j Cl < cxp{l}, 



where < c\ < 1 cx p{ 2 } _ jt remains to note that C T exp{y}C— 1 has the same distribution 
as any of the random variables fpi, i — 1, . . . , N. 

f) We observe that the space (R, IH^) has a regularity constant of 1. Due to Theorem lA.il (we 
apply Theorem I A. 1 1 with X = 1 and cr, = 2/ci for any i = 1, . . . , iV), there exists a constant 
c 4 > such that: 



E^ exp 



[ciVa 7 |/ e |l 


> = E5 exp < 






1 


I 2c4 


2c 4 \/iV J 



< exp{l}. 
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It remains to choose c > max{2exp{l}/ci,C3/c2, 1c^jc\\. ■ 
We are ready to prove Proposition 13. II 

Proof 

Let V <E S n and recall definitions (|52p - (|M)l . Consider the random element: 

& := d £ - feh(V). 

Lemma TB.2I implies = 0. Moreover, by the same lemma, there exists a constant c\ > such 

that: 



E £ exp ■ 



< exp ■ 



N\\dt\[ 



exp 



N\\fth(V)\\ 









1/2 


E^ exp / 


< 


E^ exp | 









^ II AWL, 



< exp 



Ej exp 



c\Jk 



1/2 



< exp{l}, 

where we use Holder's inequality and the facts ||/i(V)|| x * — & anc ^ k>1. Let 

74 := /i f (V) - & - 
As £ (V) = / 5 + 1 and .A* (G e (V)) =d s + h(V), it holds that: 

A* (Gs(V)) _ d i + h(V) 



We obtain: 



d s + h{V) 



Consider the sets: 
When £ € II, we have: 



n := U : |/d < 1/2} 



- d e + 



l7«L, 



More generally, it holds that: 



flKv) - de/ 4 



and II:=M nx7V \n. 

^ <2||/|ft(v)-de/ e || i 



ll+/d 

< 2 (|/ { | 2 £ + Udells) <\h\c + \\^ 



nil,* < \\k(v)\\^ + \\m x ,, + \\Kv)\\^ 

< II^WIU + IMdU + ll/«WIU + IIWIU* 
- IIM^L. + lldelL. + IAIIIWIU. + IIWL,, 



1/2 



(57) 



(58) 
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< ( 2 + \f i \)C + \\d e \\ x ^ 
which is due to ||/i(V)|| t < C and to the following inequality: 

YZi (exp{V/2}f*) (exp{^/2}C) T 

\\he(y)\L. < £ 



c. 



Furthermore, denoting by P the probability measure of the random matrix £: 



n 



'[|/d>V2] = 



exp ■ 



< exp • 

< exp • 



'N 
~2c7 



Ej exp ^ ■ 
cxp{l}, 



Cl 
Cl 



> exp ■ 



'N 
~2c~[ 



where the inequalities follow from Markov's inequality and from statement e) of Lemma IB. 21 
spectively. Choose C2 > 4ci and observe: 



E £ exp ■ 



3c 2 £\/k 



exp ■ 



N\ht\\ x , 



exp 



N\hs\\ x , 



dP(0. 



By (|58|). Cauchy-Schwarz inequality, and Lemma IB. 21 we obtain: 



exp 



3c 2 Cy/n 



dF(£) < / exp 



Jv(|/ € |£ + ||d f | 



< 1 



E^ exp ■ 



1/3 



E £ exp ■ 



AMI* 



1/3 



Additionally, by (|59|) . we have: 



exp • 



JV 2£+|/ f |£ + ||d £ || 



Let 



.4 



< exp \ — > exp < — > = exp < — 



dP(0 < / exp 



exp 



VN(\ft\£ + \\dt 



Cauchy-Schwarz inequality, bound (|60p . and Lemma IB . 2 1 imply: 



dP(f). 



A < 



idP(0 



1/3 



exp ■ 



C^Jk 



"(0 



1/3 



exp 



N\\d ( \\ 2 



dP(0 



1/3 
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< 



< 



n 



n 



1/3 



1/3 



exp < 




f 


1/3 


E^ exp < 




\] 








c 2 C^/k 




exp < 






1/3 


E^ exp < 






I «* J 


[ 




C 2 Cy/K 


>} 



1/:! 



n i/3 



< exp 



' N 
6ci 



exp 



As C2 > 4ci, we obtain: 



Ee exp 



7>c 2 tLsJ~K 



< exp < - 



(21 


• + exp < 


f 2VN) 


> exp | 


Uj 




(21 


• + exp < 






13J 


[ 3c 2 


6ci J 



exp 



— — > exp < - > < 2 exp < - 



>N 
6ci 



(61) 



Thus, 
E^ exp ■ 



N\\h^v)-hiy)\\ 

6c 2 £y/K 



JiL^ exp ^ 
E^ exp 
Ej exp 



< 



< 



Well 



i V2 r 



3c2£v^ 



i V2 r 



Ejr exp 
E^ exp 



3c 2 £,/k 
3c 2 Cy/n 



1/2 



1/2 



< \/2cxp <; - f- 



(62) 



where the inequalities are due to Holder's inequality, the fact that c 2 > ci, bound (|57)l . and 
inequality (j6"Tj). respectively. 

It remains to find an appropriate bound on the norm of h(V) — K^h^(V). Recall that Ej/3^ = 0. 
Therefore, 

mh(y)-h(v)\\ Xt , = \\E a d x ,* <%hflU 

= / Il7fll a . dp (0+ / INU^O 

< 2 f \f s \ 2 c + \f s \\\d s \\ Xt ,dP(o+ L WnLjnO, 

Jn Jn 

where the concluding inequality follows from As 2exp(x) > x 2 for any x > 0, we obtain by 

Lemma IB. 21 



2£ / |/ £ | 2 dP(0 < 2£E € |/ ? | 
7n 



2 4cf£ 
2 <^E 5 exp. 



< 4exp{l}c 2 £ 



Cl 



iV 
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Furthermore, the same arguments imply: 

2/ |/ f | IKII^ dP(0 < / ^£|/e| 2 dP(e) + 



I dell 



< E c {V^£|/ e | 2 } 
2c 2 £^ 



< 



< 



N 



E^ < exp 



n 

df\\ 2 

N\h\\ 



-dP(0 



Wild 



Cl 



< exp 



s iia: * 



4exp{l}c 2 £ v / '* 
TV ' 



Finally, 



beiu^e) < 











1/2 




< 


p 


ri 






Eehell^' 



1/2 













n 








[3" 


4ci ^ 





E« hell 



TV 



1/2 



-Ee exp 



^ hell 



3c2>Cv / '* 



1/2 



6exp{§} c 2 £yfK 
< exp 



. TV 

6exp {§} 
< Lil == exp 



'N 
4ci 

C2 



< 



6exp{|}c^v^ 



TV 



where the inequalities hold due to Holder's inequality, the fact that hellj.,, is nonnegative for 
any £ £ M. nxN , bound (p0"|) . the fact that 2exp(x) > x 2 for any a; > 0, inequality (pjTj) . and the 
assumption C2 > 4ci, respectively. We obtain: 



\\E^(V)-h(V)\\ x . < 



< 



< 



4cxp{l}c 2 £ 4exp{l}c 2 /y^ 6cxp{l}c|£^ 

N + N + N 

8exp{l}c 2 £y / K &cxp{l}c\C^fTi 

N + N 

explljc^v^ Gexpjljc^v 7 '* 

2N + 
lScxpjljc^v^ 

2N ' 



N 



which proves statement a). The above inequality ensures together with bound (|62 



E^ exp 



N\\ht(V)-Ezhs(V)\\ Xtt 



< \/2c> i g 



13 exp{l}c2 



12VN 
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J 5 13exp{l}c 2 
< V2cxp<j - + 



Let: 

5 ln(2) 13exp{l}c 2 
C3 := 6 + ^ + 12 ' 

By Jensen's inequality, we conclude that: 



VN\\h,(V)-E,h,(V)\\ x ^ 
E^exp^ —= - \ < exp{l}. 
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